27 research outputs found

    A Similarity Measure for GPU Kernel Subgraph Matching

    Full text link
    Accelerator architectures specialize in executing SIMD (single instruction, multiple data) in lockstep. Because the majority of CUDA applications are parallelized loops, control flow information can provide an in-depth characterization of a kernel. CUDAflow is a tool that statically separates CUDA binaries into basic block regions and dynamically measures instruction and basic block frequencies. CUDAflow captures this information in a control flow graph (CFG) and performs subgraph matching across various kernel's CFGs to gain insights to an application's resource requirements, based on the shape and traversal of the graph, instruction operations executed and registers allocated, among other information. The utility of CUDAflow is demonstrated with SHOC and Rodinia application case studies on a variety of GPU architectures, revealing novel thread divergence characteristics that facilitates end users, autotuners and compilers in generating high performing code

    Model-driven approach for supporting the mapping of parallel algorithms to parallel computing platforms

    Get PDF
    The trend from single processor to parallel computer architectures has increased the importance of parallel computing. To support parallel computing it is important to map parallel algorithms to a computing platform that consists of multiple parallel processing nodes. In general different alternative mappings can be defined that perform differently with respect to the quality requirements for power consumption, efficiency and memory usage. The mapping process can be carried out manually for platforms with a limited number of processing nodes. However, for exascale computing in which hundreds of thousands of processing nodes are applied, the mapping process soon becomes intractable. To assist the parallel computing engineer we provide a model-driven approach to analyze, model, and select feasible mappings. We describe the developed toolset that implements the corresponding approach together with the required metamodels and model transformations. We illustrate our approach for the well-known complete exchange algorithm in parallel computing. © 2013 Springer-Verlag

    Performance Analysis Tool for HPC and Big Data Applications on Scientific Clusters

    Full text link
    Big data is prevalent in HPC computing. Many HPC projects rely on complex workflows to analyze terabytes or petabytes of data. These workflows often require running over thousands of CPU cores and performing simultaneous data accesses, data movements, and computation. It is challenging to analyze the performance involving terabytes or petabytes of workflow data or measurement data of the executions, from complex workflows over a large number of nodes and multiple parallel task executions. To help identify performance bottlenecks or debug the performance issues in large-scale scientific applications and scientific clusters, we have developed a performance analysis framework, using state-ofthe- art open-source big data processing tools. Our tool can ingest system logs and application performance measurements to extract key performance features, and apply the most sophisticated statistical tools and data mining methods on the performance data. It utilizes an efficient data processing engine to allow users to interactively analyze a large amount of different types of logs and measurements. To illustrate the functionality of the big data analysis framework, we conduct case studies on the workflows from an astronomy project known as the Palomar Transient Factory (PTF) and the job logs from the genome analysis scientific cluster

    Can We Monitor All Multithreaded Programs?

    Get PDF
    International audienceRuntime Verification (RV) is a lightweight formal method which consists in verifying that an execution of a program is correct wrt a specification. The specification formalizes with properties the expected correct behavior of the system. Programs are instrumented to extract necessary information from the execution and feed it to monitors tasked with checking the properties. From the perspective of a monitor, the system is a black box; the trace is the only system information provided. Parallel programs generally introduce an added level of complexity on the program execution due to concurrency. A concurrent execution of a parallel program is best represented as a partial order. A large number of RV approaches generate monitors using formalisms that rely on total order, while more recent approaches utilize formalisms that consider multiple traces. In this tutorial, we review some of the main RV approaches and tools that handle multithreaded Java programs. We discuss their assumptions, limitations, ex-pressiveness, and suitability when tackling parallel programs such as producer-consumer and readers-writers. By analyzing the interplay between specification formalisms and concurrent executions of programs, we identify four questions RV practitioners may ask themselves to classify and determine the situations in which it is sound to use the existing tools and approaches

    Tools for OpenMP Application Development: The POST Project

    No full text
    OpenMP was recently proposed by a group of vendors as a programming model for shared memory parallel architectures. Thr growing popularity of such systems, and the rapid availability of product-strength compilers for OpenMP, seem to guarantee a broad take-up of this paradigm if appropriate tools for application development can be provided. POST is an EU-funded project that is developing a production based on FORESYS from Simulog, which aims to reduce the human effort involved in the creation of OpenMP code. Additional research within the project focuses on alternative techniques to support OpenMP application development that target a broad variety of users. Functionality ranges from fully automatic strategies to novice users, the provision of parallelisation hints, and step-by-step strategies to porting code, to a range of transformations and source code analyses that may be used by experts, including the ability to create application specific transformations. The work is accompanied by the development of OpenMP versions of several industrial applications

    Pengaruh Penambahan Konsentrat dengan Kadar Protein Kasar yang Berbeda pada Ransum Basal terhadap Performans Kambing Boerawa Pasca Sapih

    Full text link
    The aims of the research is done to study the influence of additional concentrate with the difference crude protein value on bassal ration to the performance Boerawa goat – pasca weaning. The goat used is Boerawa goat - pasca weaning was used as many as 20 goats with average beginning weight 18,25 ± 6,13 kg/goats from Gisting. This research uses randomized block design, consists of four treatments, R0= bassal ration, R1= R0 (60%) + concentrate A (40%), R2= R0 (60%) + concentrate B (40%), and R3= R0 (60%) + concentrate C (40%), with five replications. The fresh water was given Boerawa goat by ad libitum during the research. The result of the research shows: there was significant effect (P0,05) for the increasing of body weight, protein efficiency ratio, and feeding conversion

    Characterizing the Impact of Prefetching on Scientific Application Performance

    No full text
    Abstract—In order to better understand the impact of data prefetching on scientific application performance, this paper introduces two analysis techniques, one micro-architecture-centric and the other application-centric. We use these techniques to analyze representative full-scale production applications from five important Exascale target areas. We find that despite a great diversity in prefetching effectiveness across and even within applications, there is a strong correlation between regions where prefetching is most needed, due to high levels of memory traffic, and where it is most effective. We also observe that the application-centric analysis can explain many of the differences in prefetching effectiveness observed across the studied applications. I

    A survey on optimizations towards best-effort hardware transactional memory

    No full text

    A Guidance Tool to Improve Memory Locality Reuse and To Exploit Hidden Parallelism in Loop Nests

    No full text
    In recent years, methods for analyzing and parallelizing sequential code using dependence analysis and loop transformations have beendeveloped. These techniques have proved successful, and have been used either to move from sequential to parallel codes, or to improve the e ciency of existing parallel codes. Our research focuses on Fortran code optimisation for parallelisation in Shared Memory architectures by using analysis and source to source transformations. Our optimisation strategy, although designed for OpenMP directives, is su ciently general to be used for pure Fortran code. In this paper we describe a tool, the Automatic Guidance Module (AGM). Its functionality ranges from fully automatic strategies for novice users, to a range of transformation and source code analyses that may be used by experts.
    corecore